🤖 Machine Learning

Plan of the lecture

  1. What is AI?
  2. What is Machine Learning?
  3. Scikit-learn Library
  4. Linear Regression with Scikit
  5. Generalization - Holdout Bias Variance
  6. Cross Validation

1. What is AI?

Artificial intelligence (AI) is an area of computer science that emphasizes the creation of intelligent machines that work and react like humans.

Artificial intelligence was born in the 1950s, when a handful of pioneers from the nascent field of computer science started asking whether computers could be made to “think”.

♟ ♟ Early Chess Programs ♟ ♟

  • First example of (symbolic AI)
  • Only involved hard-coded rules crafted by programmers
  • Human-level artificial intelligence = programmers handcraft a sufficiently large set of explicit rules

Could a computer go beyond “what we know how to order it to perform” and learn on its own how to perform a specified task?

Could a computer surprise us? Rather than programmers crafting data-processing rules by hand

Could a computer automatically learn these rules by looking at data?

2. What is Machine Learning?

The area of computational science that focuses on analyzing and interpreting patterns and structures in data to enable learning, reasoning, and decision making outside of human interaction.

Learning (statistical) patterns that governs a phenomenon.

Machine Learning == Statistical Learning

General programming vs. Machine Learning

A ML system is trained rather then programmed.

It is presented with many examples relevant to a task in order to find the statistical structure in these examples that eventually allows the system to come up with rules for automating the task.

Machine Learning Taxonomy

Supervised Learning

Develop predictive model based on both input and output data.

Classification vs Regression

Unsupervised Learning

Group and interpret data based only on input data.

Jargon

The features can also be referred to as the input, the X’s, the variables or covariates.

The target can also be referred to as the output, the y, the label, the class or the outcome.

The samples can also be referred to as the rows or the observations.

ML stages

ML domains and tasks

2. Scikit-learn

Scikit-learn (Sklearn) is a Machine Learning library that provides data preprocessing, modeling, and model selection tools.

👉 https://scikit-learn.org

Installing Sklearn

In your terminal, type the following:

pip install -U scikit-learn

👉 Installation Documentation

Sklearn structure

  • Sklearn is organized by modules

  • Each module contains tools in the form of classes

linear_model module

  • linear_model is a module

  • LinearRegression is a class

👉 Sklearn linear_model documentation

Module and Class imports

There are many ways to import modules and classes in notebooks, but there is a best practice.

🚫 import sklearn # import of entire library
model = sklearn.linear_model.LinearRegression() # must type library and module prefix every time
🚫 import sklearn.linear_model # import of entire module
model = linear_model.LinearRegression() # must type module prefix every time
🚫 from sklearn import linear_model # import of entire module
model = linear_model.LinearRegression() # must type module prefix every time
🚫 from sklearn.linear_model import * # import of entire module
model = LinearRegression()

Explicit is better than implicit” - The Zen of Python

from sklearn.linear_model import LinearRegression # explicit class import from module
model = LinearRegression() #=> we know where this object comes from

3. Linear Modeling with Sklearn

Consider the following dataset (download here). It is a collection of houses and their characteristics, along with their sale price. The full documentation of the dataset is available here.

import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv")

data.head()
   Id  MSSubClass MSZoning  ...  SaleType  SaleCondition SalePrice
0   1          60       RL  ...        WD         Normal    208500
1   2          20       RL  ...        WD         Normal    181500
2   3          60       RL  ...        WD         Normal    223500
3   4          70       RL  ...        WD        Abnorml    140000
4   5          60       RL  ...        WD         Normal    250000

[5 rows x 85 columns]

Let’s start simple by modeling the SalePrice (y) according to the GrLivArea (X).

livecode_data = data[['GrLivArea','SalePrice']]

livecode_data.head()
   GrLivArea  SalePrice
0       1710     208500
1       1262     181500
2       1786     223500
3       1717     140000
4       2198     250000

Exploration

import matplotlib.pyplot as plt

# Plot Living area vs Sale price
plt.scatter(data['GrLivArea'], data['SalePrice'])

# Labels
plt.xlabel("Living area")
plt.ylabel("Sale price")

plt.show()

Training

Training a Linear Regression model with Sklearn LinearRegression

# Import the model
from sklearn.linear_model import LinearRegression

# Instanciate the model (💡 in Sklearn often called "estimator")
model = LinearRegression()

# Define X and y
X = data[['GrLivArea']]
y = data['SalePrice']

# Train the model on the data
model.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

👉 Sklearn LinearRegression documentation

At this stage, the model has learned the optimal parameters - slope a and intercept b - needed to map the relationship between X and y.

Model Attributes

a (slope) and b (intercept) are stored as model attributes and can be accessed.

# View the model's slope (a)
model.coef_ 
array([105.00927564])
# View the model's intercept (b)
model.intercept_ 
22104.121010020463

Scoring

Each Scikit-learn algorithm has a default scoring metric.

LinearRegression uses the Coefficient of determination (\(R^2\)) by default.

  • \(R^2\) represents the proportion of the variance of the target explained by the features.

  • The score typically varies between 0 and 1

  • The higher score the better the model

# Evaluate the model's performance
model.score(X,y)
0.48960426399689116

💡 Different models will have different default scoring metrics. You can look them up in the .score() method in the model’s docs.

For example, a classifier like LogisticRegression will default scoring to accuracy.

Predicting

The trained model can be used to predict new data

#  Predict on new data
model.predict([[1000]])
array([127113.39664561])

/home/ahmed/.local/lib/python3.10/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
  warnings.warn(

👉 An apartment with an surface area of 1000 \(ft^2\) has a predicted value of about $125k.

❗ Note that your X (features) almost always need to be a 2D-array when passed as an argument to an sklearn API method

Sklearn modeling flow

  1. Import the model: from sklearn import model

  2. Instantiate the model: model = model()

  3. Train the model: model.fit(X, y)

  4. Evaluate the model: model.score(new_X, new_y)

  5. Make predictions: model.predict(new_X)

❓ What did we do wrong when scoring the model’s performance?

👉 We scored the model on the same data it was trained on!!

4. Generalization

The performance of a Machine Learning model is evaluated on its ability to generalize when predicting unseen data.

The Holdout Method

The Holdout Method is used to evaluate a model’s ability to generalize. It consists of splitting the dataset into two sets:

  • Training set (~70%)

  • Testing set (~30%)

Example

Imagine our dataset has 9 observations

💻 train_test_split

Let’s model the SalePrice (y) according to the GrLivArea (X) whilst keeping generalization in mind.

livecode_data.head()
   GrLivArea  SalePrice
0       1710     208500
1       1262     181500
2       1786     223500
3       1717     140000
4       2198     250000

Splitting

from sklearn.model_selection import train_test_split

# split the data into train and test
train_data, test_data = train_test_split(livecode_data, test_size=0.3)

# Ready X's and y's
X_train = train_data[['GrLivArea']]
y_train = train_data['SalePrice']

X_test = test_data[['GrLivArea']]
y_test = test_data['SalePrice']

You could also directly pass X and y to train_test_split.

# Ready X and y
X = livecode_data[['GrLivArea']]
y = livecode_data['SalePrice']

# Split into Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Training and scoring

# Instantiate the model
model = LinearRegression()

# Train the model on the Training data
model.fit(X_train, y_train)

# Score the model on the Test data
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.score(X_test,y_test)
0.5066793488829242

❓Can you think about any limitations of the Holdout Method?

Data split is random

  • Different random splits will create different results
### RUN THIS CELL MULTIPLE TIMES TO SEE DIFFERENT SCORES

# Split into Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Instantiate the model
model = LinearRegression()

# Train the model on the Training data
model.fit(X_train, y_train)

# Score the model on the Test data
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.score(X_test,y_test)
0.47140651143812695

Lose information

  • The data in the Test set is not used to train the model

  • If you have a small dataset, that loss could be significant

❓ How would you solve that issue?

👉 Average the scores of multiple holdout splits.

K-Fold Cross Validation

  • The dataset is split into K number of folds
  • For each split, a sub model is trained and scored
  • The average score of all sub models is the cross-validated score of the model

Dataframe view

💻 cross_validate

from sklearn.model_selection import cross_validate

# Instantiate model
model = LinearRegression()

# 5-Fold Cross validate model
cv_results = cross_validate(model, X, y, cv=5)

# Scores
cv_results['test_score']

# Mean of scores
array([0.55810657, 0.52593307, 0.50430916, 0.3911751 , 0.45203221])
cv_results['test_score'].mean()
0.4863112208425962

Choosing K

  • Choosing K is a tradeoff between trustworthy performance evaluation and computational expense

  • More K-folds –> more submodels to average scores from –> more representative score –> more computational time

ℹ Rule of thumb: K=5 or 10

⚠️ Cross validation does not output a trained model, it only scores a hypothetical model trained on the entire dataset.

The Bias / Variance tradeoff

For a model to generalize there will be a tradeoff between bias and variance.

  • Bias (Underfitting): The inability for an algorithm to learn the patterns within a dataset.

  • Variance (Overfitting): The algorithm generates an overly complex relationship when modeling patterns within a dataset.

No Free Lunch Theorem

Some models oversimplify, while others overcomplicate a relationship between the features and the target.

It’s up to us as data scientists to make assumptions about the data and evaluate reasonable models accordingly.

There is no one-size-fits-all model, this is known as the No Free Lunch Theorem.